Skip to content

FIX: Release GIL around blocking SQLSetConnectAttr calls (#565)#568

Merged
gargsaumya merged 6 commits into
mainfrom
dev/saurabh/fix-565-gil-release-conn-attrs
May 14, 2026
Merged

FIX: Release GIL around blocking SQLSetConnectAttr calls (#565)#568
gargsaumya merged 6 commits into
mainfrom
dev/saurabh/fix-565-gil-release-conn-attrs

Conversation

@saurabh500
Copy link
Copy Markdown
Contributor

@saurabh500 saurabh500 commented May 11, 2026

Work Item / Issue Reference

GitHub Issue: #565


Summary

Fixes the indefinite hang of mssql_python.connect() when the target server is reached through an in-process SSH tunnel created with paramiko + sshtunnel (issue #565, originally reported as #491).

Root cause

PR #541 released the GIL around SQLDriverConnect, SQLDisconnect, and SQLEndTran, but a handful of SQLSetConnectAttr call sites in mssql_python/pybind/connection/connection.cpp were left holding the GIL:

  • Connection::setAutocommitSQL_ATTR_AUTOCOMMIT. This is the call that hangs in the issue's repro: it runs by default immediately after every connect().
  • Connection::setAttribute (int / wide-string / bytes paths) — user-facing set_attr().
  • Connection::resetSQL_ATTR_RESET_CONNECTION and SQL_ATTR_TXN_ISOLATION, called by the connection pool on acquire.

When the GIL is held during a call that ends up doing blocking work, paramiko's transport thread cannot run and so cannot forward bytes through the SSH channel — the driver waits forever for a response that never comes. pyodbc does not exhibit this because it releases the GIL around its blocking ODBC calls.

Fix

Wrap each of these call sites in py::gil_scoped_release, matching the pattern PR #541 already established for SQLDriverConnect / SQLEndTran. Local-only attribute reads (SQLGetConnectAttr for SQL_ATTR_CONNECTION_DEAD / SQL_ATTR_AUTOCOMMIT, SQLGetInfo, SQLAllocHandle) are intentionally left alone.

Additional fix: Connection pool mutex/GIL deadlock

The initial GIL-release change in Connection::reset() introduced a secondary deadlock in ConnectionPool::acquire(). The pool's acquire() method called conn->isAlive() and conn->reset() while holding the pool _mutex. With reset() now releasing the GIL internally, this created a classic ABBA lock-ordering inversion:

Thread A Thread B
Holds _mutex, calls reset() Holds GIL, waiting on _mutex
reset() releases GIL, ODBC call Gets nothing (blocked on mutex)
ODBC returns, tries to reacquire GIL — blocked Still blocked on _mutex
Deadlock: holds mutex, needs GIL Deadlock: holds GIL, needs mutex

The fix restructures ConnectionPool::acquire() to pop one candidate connection from the pool under the mutex, then validate it (isAlive() + reset()) outside the mutex. This follows the same pattern already established in the pool for connect() (Phase 2.5) and disconnect() (Phase 3/4), where blocking ODBC calls that release the GIL are performed without holding the pool lock.

Why this is safe:

  1. No data races on _pool: The candidate is popped from the deque atomically under the mutex before any validation. No other thread can access the same connection object.
  2. _current_size stays accurate: Stale connections are decremented during pruning (Phase 1). Failed candidates are decremented individually after validation. New connections increment the count under the mutex before connect() is called.
  3. Pool size limit is respected: The "reserve a new slot" check (_current_size < _max_size) still happens under the mutex, after confirming no viable candidates remain.
  4. Exception safety: If isAlive() or reset() throws, the connection is caught and treated as dead — it goes to the disconnect list and _current_size is decremented.
  5. Single-threaded behavior unchanged: With one thread, the loop simply pops, validates, and returns — identical to the previous while-loop but without holding the mutex during ODBC calls.

Verification

  • Reproduced Reopen Issue #491: SSH tunneling fails using paramiko+sshtunnel #565 locally with an in-process Python TCP forwarder (simulating paramiko/sshtunnel). Before the GIL-release fix, mssql_python.connect() hung indefinitely; after the fix it completes in ~0.3s.
  • The pool deadlock was reproduced by returning connections to the pool and having 2+ threads call connect() concurrently — this hung indefinitely before the pool fix and completes instantly after.
  • Full test suite: 1707 passed, 64 skipped, 0 failures (~2 minutes).

Diff: 2 files changed — connection.cpp (+42/-12 GIL release) and connection_pool.cpp (+39/-26 mutex restructure).

@saurabh500 saurabh500 force-pushed the dev/saurabh/fix-565-gil-release-conn-attrs branch from 74814ca to 204849c Compare May 11, 2026 17:45
@github-actions github-actions Bot added pr-size: small Minimal code update pr-size: medium Moderate update size labels May 11, 2026
Comment thread tests/test_023_ssh_tunnel_gil_release.py Fixed
@saurabh500 saurabh500 force-pushed the dev/saurabh/fix-565-gil-release-conn-attrs branch from 827eea5 to 92a568b Compare May 11, 2026 18:35
@github-actions github-actions Bot removed the pr-size: small Minimal code update label May 11, 2026
@saurabh500 saurabh500 force-pushed the dev/saurabh/fix-565-gil-release-conn-attrs branch from 92a568b to 0f31307 Compare May 11, 2026 18:42
Comment thread tests/test_023_ssh_tunnel_gil_release.py Dismissed
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 12, 2026

📊 Code Coverage Report

🔥 Diff Coverage

78%


🎯 Overall Coverage

25%


📈 Total Lines Covered: 6987 out of 27097
📁 Project: mssql-python


Diff Coverage

Diff: main...HEAD, staged and unstaged changes

  • mssql_python/pybind/connection/connection.cpp (80.0%): Missing lines 346-348,350-354
  • mssql_python/pybind/connection/connection_pool.cpp (76.9%): Missing lines 70,85,90-93

Summary

  • Total: 66 lines
  • Missing: 14 lines
  • Coverage: 78%

mssql_python/pybind/connection/connection.cpp

Lines 342-358

  342         try {
  343             // Copy to a stack-local buffer so that releasing the GIL
  344             // doesn't expose a race where another thread overwrites the
  345             // member strBytesBuffer while the driver reads from ptr.
! 346             std::string localBytesBuffer = value.cast<std::string>();
! 347             SQLPOINTER ptr = const_cast<char*>(localBytesBuffer.c_str());
! 348             SQLINTEGER length = static_cast<SQLINTEGER>(localBytesBuffer.size());
  349 
! 350             SQLRETURN ret;
! 351             {
! 352                 py::gil_scoped_release release;
! 353                 ret = SQLSetConnectAttr_ptr(_dbcHandle->get(), attribute, ptr, length);
! 354             }
  355             if (!SQL_SUCCEEDED(ret)) {
  356                 LOG("Failed to set binary attribute=%d, ret=%d", attribute, ret);
  357             } else {
  358                 LOG("Set binary attribute=%d successfully (length=%d)", attribute, length);

mssql_python/pybind/connection/connection_pool.cpp

Lines 66-74

  66                     // slot will open up — but we can't wait for it here without
  67                     // adding a condition-variable retry loop.  This is an
  68                     // acceptable trade-off: transient "pool full" errors under
  69                     // heavy contention are rare and callers can retry.
! 70                     throw std::runtime_error("ConnectionPool::acquire: pool size limit reached");
  71                 }
  72                 break;
  73             }
  74             candidate = _pool.front();

Lines 81-97

  81                 valid_conn = candidate;
  82                 break;
  83             }
  84         } catch (const std::exception& ex) {
! 85             LOG("Candidate connection validation failed: %s", ex.what());
  86         }
  87 
  88         // Candidate is dead or reset failed — mark for disconnect and
  89         // decrement the pool size.
! 90         to_disconnect.push_back(candidate);
! 91         {
! 92             std::lock_guard<std::mutex> lock(_mutex);
! 93             if (_current_size > 0) --_current_size;
  94         }
  95     }
  96 
  97     // Phase 3: Connect the new connection outside the mutex.


📋 Files Needing Attention

📉 Files with overall lowest coverage (click to expand)
mssql_python.pybind.build._deps.simdutf-src.src.haswell.implementation.cpp: 0.4%
mssql_python.pybind.build._deps.simdutf-src.src.implementation.cpp: 6.7%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.implementation.h: 10.4%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.scalar.utf16_to_utf8.utf16_to_utf8.h: 25.3%
mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.pybind.ddbc_bindings.h: 59.7%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.internal.isadetection.h: 65.3%
mssql_python.row.py: 70.5%
mssql_python.pybind.logger_bridge.hpp: 70.8%
mssql_python.pybind.ddbc_bindings.cpp: 74.2%

🔗 Quick Links

⚙️ Build Summary 📋 Coverage Details

View Azure DevOps Build

Browse Full Coverage Report

@saurabh500 saurabh500 marked this pull request as ready for review May 12, 2026 23:49
Copilot AI review requested due to automatic review settings May 12, 2026 23:49
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a GIL-related deadlock where mssql_python.connect() could hang indefinitely when the connection’s network path is mediated by another in-process Python thread (e.g., paramiko/sshtunnel). It does so by releasing the GIL around blocking SQLSetConnectAttr calls, and adjusts connection-pool acquisition to avoid a mutex/GIL lock-order inversion when reset() now releases the GIL.

Changes:

  • Release the Python GIL around potentially blocking SQLSetConnectAttr call sites (autocommit, user set_attr() paths, and pooled reset paths).
  • Restructure ConnectionPool::acquire() to validate candidate connections outside the pool mutex to prevent mutex/GIL deadlocks.
  • Add a Linux-only functional regression test that reproduces the deadlock via a pure-Python in-process TCP forwarder executed in a subprocess with a hard watchdog.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
mssql_python/pybind/connection/connection.cpp Releases GIL around blocking SQLSetConnectAttr calls to prevent tunnel-thread starvation deadlocks.
mssql_python/pybind/connection/connection_pool.cpp Refactors acquire path to avoid holding the pool mutex across ODBC calls that may release/reacquire the GIL.
tests/test_023_ssh_tunnel_gil_release.py Adds a subprocess-based watchdog regression test reproducing the hang scenario.
Comments suppressed due to low confidence (1)

mssql_python/pybind/connection/connection_pool.cpp:62

  • In acquire(), a thread can pop a candidate for validation (outside the mutex) while another thread sees the pool empty with _current_size == _max_size and immediately throws pool size limit reached. If the in-flight candidate later fails validation and decrements _current_size, the second thread has already failed even though capacity would have become available. Consider tracking “in validation” candidates (or temporarily reserving/releasing capacity) so that exhaustion is decided after all popped candidates are conclusively validated, not while a candidate is being checked outside the lock.
    while (true) {
        std::shared_ptr<Connection> candidate;
        {
            std::lock_guard<std::mutex> lock(_mutex);
            if (_pool.empty()) {
                // No more candidates — try to reserve a slot for a new connection.
                if (_current_size < _max_size) {
                    valid_conn = std::make_shared<Connection>(connStr, true);
                    ++_current_size;
                    needs_connect = true;
                } else {
                    throw std::runtime_error("ConnectionPool::acquire: pool size limit reached");
                }
                break;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mssql_python/pybind/connection/connection_pool.cpp
Comment thread tests/test_023_ssh_tunnel_gil_release.py Outdated
gargsaumya
gargsaumya previously approved these changes May 13, 2026
Copy link
Copy Markdown
Contributor

@gargsaumya gargsaumya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Copy Markdown
Collaborator

@bewithgaurav bewithgaurav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spotted a buffer race on the windows wide-string path in setAttribute. ptr aims into the shared wstrStringBuffer member field and the new GIL release lets a second thread sharing the connection re-enter setAttribute and reallocate the buffer mid-call. the field was always shared, but the GIL was serializing access by accident. couldn't repro from macOS since that path uses a stack-local.
inline comment below has a sketch.

rest looks good - tested locally, ssh repro from #565 passes here, pool restructure holds up under concurrent acquire stress.

Comment thread mssql_python/pybind/connection/connection.cpp
Comment thread tests/test_023_ssh_tunnel_gil_release.py
gargsaumya
gargsaumya previously approved these changes May 14, 2026
bewithgaurav
bewithgaurav previously approved these changes May 14, 2026
saurabh and others added 6 commits May 14, 2026 12:15
PR #541 released the GIL around SQLDriverConnect / SQLDisconnect /
SQLEndTran, but several SQLSetConnectAttr sites in connection.cpp
were left holding the GIL:

  - Connection::setAutocommit (SQL_ATTR_AUTOCOMMIT)
  - Connection::setAttribute (user-facing set_attr)
  - Connection::reset (SQL_ATTR_RESET_CONNECTION,
    SQL_ATTR_TXN_ISOLATION) — called by the pool on acquire.

setAutocommit is the call that hangs in the issue #565 repro: it
runs by default immediately after connect, and when the network
path goes through another Python thread (e.g. an in-process SSH
tunnel via paramiko + sshtunnel), that thread cannot make progress
while the GIL is held — deadlock.

Wrap each of these calls with py::gil_scoped_release, matching the
pattern PR #541 introduced for the other ODBC entry points.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Routes mssql_python.connect()+SELECT through a pure-Python TCP
forwarder running in a subprocess, with a hard external watchdog
(Popen.communicate(timeout=...)).

The forwarder's _pipe loop calls dst.sendall(...) on every chunk;
that bound-method dispatch needs the GIL, which is exactly the
deadlock condition from issue #565: with the GIL held across
SQLSetConnectAttr(SQL_ATTR_AUTOCOMMIT), the forwarder thread
cannot run, the driver waits forever for a reply, and the
subprocess hangs.

Verified that the test fails (in 30s, with the issue-#565 diagnostic
message) against the unfixed binary on main, and passes (~0.2s)
once the GIL release on setAutocommit/setAttribute/reset is in place.

Linux-only and runs in the default functional suite (not stress).
A subprocess + external watchdog is required because once the
worker thread deadlocks while holding the GIL, the entire Python
interpreter is starved — an in-process watchdog thread cannot
make progress either.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The PR's GIL-release fix in Connection::reset() creates a mutex/GIL
lock-ordering deadlock when multiple threads acquire pooled connections
concurrently:
- Thread A holds pool mutex, releases GIL inside reset()
- Thread B gets GIL, blocks on pool mutex
- Thread A's ODBC call returns, blocks reacquiring GIL → deadlock

Fix: Pop one candidate at a time from the pool under the mutex, then
validate (isAlive + reset) outside the mutex. This follows the same
pattern already used for connect() and disconnect() in the pool.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The CI environment does not have mssql_python installed system-wide;
pytest's path manipulation makes it importable in the test process but
the spawned subprocess does not inherit that. Pass sys.path via
PYTHONPATH so the subprocess can locate mssql_python.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Use os.path.abspath(__file__) for subprocess re-exec robustness
- Add comments clarifying _current_size tracks reserved capacity
- Document the pool-full race window as an acceptable trade-off

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…L release

On the Windows string path and the bytes path, ptr pointed into a member
field (wstrStringBuffer / strBytesBuffer).  Releasing the GIL around the
ODBC call allowed another thread to overwrite the member and trigger a
reallocation, leaving ptr dangling.

Copy to a stack-local variable before releasing the GIL — the same
pattern the macOS/Linux branch already uses via its SQLWCHAR conversion.

Credit: @bewithgaurav (PR review)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@saurabh500 saurabh500 dismissed stale reviews from bewithgaurav and gargsaumya via f1704af May 14, 2026 12:21
@saurabh500 saurabh500 force-pushed the dev/saurabh/fix-565-gil-release-conn-attrs branch from c6c8eef to f1704af Compare May 14, 2026 12:21
@gargsaumya gargsaumya merged commit 75ee747 into main May 14, 2026
30 of 31 checks passed
@gargsaumya gargsaumya mentioned this pull request May 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-size: medium Moderate update size

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants